“Principal component analysis (PCA), Principal component regression (PCR), and Sparse PCA in R”

1 Some preparation and exploratory analysis

SO2: SO2 content of air in micrograms per cubic metre; temp: average annual temperature in degrees Fahrenheit; manu: number of manufacturing enterprises employing 20 or more workers; popul: population size (1970 census) in thousands; wind: average annual wind speed in miles per hour; precip: average annual precipitation in inches; predays: average number of days with precipitation per year.

rm(list = ls())
#packages <- c("HAUSR2", "elasticnet")
#if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
#  install.packages(setdiff(packages, rownames(installed.packages())), dependencies = TRUE)  
#}
library(HSAUR2)
## Loading required package: tools
head(USairpollution)
##             SO2 temp manu popul wind precip predays
## Albany       46 47.6   44   116  8.8  33.36     135
## Albuquerque  11 56.8   46   244  8.9   7.77      58
## Atlanta      24 61.5  368   497  9.1  48.34     115
## Baltimore    47 55.0  625   905  9.6  41.31     111
## Buffalo      11 47.1  391   463 12.4  36.11     166
## Charleston   31 55.2   35    71  6.5  40.75     148
attach(USairpollution)
## The following object is masked from package:datasets:
## 
##     precip

To begin we shall ignore the SO2 variable and concentrate on the others, two of which relate to human ecology (popul, manu) and four to climate (temp, Wind, precip, predays). A case can be made to use negative temperature values in subsequent analyses since then all six variables are such that high values represent a less attractive environment. This is, of course, a personal view, but as we shall see later, the simple transformation of temp does aid interpretation.

Quelle: An Introduction to Applied Multivariate Analysis with R by Brian Everitt, Torsten Hothorn p86

USairpollution$negtemp <- temp*(-1)
USairpollution <- USairpollution[,c(1,8,3:7)]
attach(USairpollution)
## The following objects are masked from USairpollution (pos = 3):
## 
##     manu, popul, precip, predays, SO2, wind
## The following object is masked from package:datasets:
## 
##     precip
head(USairpollution)
##             SO2 negtemp manu popul wind precip predays
## Albany       46   -47.6   44   116  8.8  33.36     135
## Albuquerque  11   -56.8   46   244  8.9   7.77      58
## Atlanta      24   -61.5  368   497  9.1  48.34     115
## Baltimore    47   -55.0  625   905  9.6  41.31     111
## Buffalo      11   -47.1  391   463 12.4  36.11     166
## Charleston   31   -55.2   35    71  6.5  40.75     148

Check the sample correlation matrix and do the matrix scatterplot

round(cor(USairpollution[,-1]),2) # sample correlation matrix
##         negtemp  manu popul  wind precip predays
## negtemp    1.00  0.19  0.06  0.35  -0.39    0.43
## manu       0.19  1.00  0.96  0.24  -0.03    0.13
## popul      0.06  0.96  1.00  0.21  -0.03    0.04
## wind       0.35  0.24  0.21  1.00  -0.01    0.16
## precip    -0.39 -0.03 -0.03 -0.01   1.00    0.50
## predays    0.43  0.13  0.04  0.16   0.50    1.00
plot(USairpollution[,-1]) # matrix scatterplot

#Exclude SO2

2 Principal Component Analysis

QoL: Quality of Life

“first component might be regarded as some index of “quality of life”, with high values indicating a relatively poor environment (in the authors’ opinion at least).

The second component is largely concerned with a city’s rainfall having high coefficients for precip and predays and might be labelled as the “wet weather” component.

Component three is essentially a contrast between precip and negtemp and will separate cities having high temperatures and high rainfall from those that are colder but drier. A suitable label might be simply “climate type”."

usair.pc<-princomp(USairpollution[,-1],cor=TRUE) 
summary(usair.pc,loadings=TRUE)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.4819456 1.2247218 1.1809526 0.8719099 0.33848287
## Proportion of Variance 0.3660271 0.2499906 0.2324415 0.1267045 0.01909511
## Cumulative Proportion  0.3660271 0.6160177 0.8484592 0.9751637 0.99425879
##                             Comp.6
## Standard deviation     0.185599752
## Proportion of Variance 0.005741211
## Cumulative Proportion  1.000000000
## 
## Loadings:
##         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## negtemp  0.330  0.128  0.672  0.306  0.558  0.136
## manu     0.612 -0.168 -0.273  0.137  0.102 -0.703
## popul    0.578 -0.222 -0.350                0.695
## wind     0.354  0.131  0.297 -0.869 -0.113       
## precip          0.623 -0.505 -0.171  0.568       
## predays  0.238  0.708         0.311 -0.580
usair.pc$scores[,1:3]
##                      Comp.1      Comp.2      Comp.3
## Albany         -0.538945616  0.79206938  1.36260519
## Albuquerque    -1.417093145 -2.86589024  1.27540034
## Atlanta        -0.598994755  0.58723563 -0.99541128
## Baltimore       0.509380032  0.02875397 -0.36359953
## Buffalo         1.390708909  1.88030104  1.77619191
## Charleston     -1.429753320  1.21058365 -0.07944350
## Chicago         6.513881753 -1.66838172 -2.28629587
## Cincinnati     -0.508220797  0.48601020 -0.26618169
## Cleveland       1.766279326  1.03942658  0.74664124
## Columbus       -0.118835030  0.64039358  0.42302559
## Dallas         -0.006878799 -1.21181090 -0.99830140
## Denver         -0.207434109 -1.96320936  1.26621359
## Des Moines     -0.131980292 -0.06120607  1.65010856
## Detroit         2.166694857 -0.27060292  0.14704284
## Hartford       -0.219106349  0.97630584  0.59461329
## Houston         0.508419801 -0.11271579 -1.99360553
## Indianapolis    0.308324007  0.36049244  0.28541539
## Jacksonville   -1.227849872  0.84912801 -1.87611109
## Kansas City    -0.131303464 -0.25205976  0.27549603
## Little Rock    -1.611599761  0.34248684 -0.83970812
## Louisville     -0.423739264  0.54055336 -0.37446946
## Memphis        -0.577848813  0.32506654 -1.11488311
## Miami          -1.533160553  1.40469861 -2.60660585
## Milwaukee       1.391024518  0.15761872  1.69127813
## Minneapolis     1.500032786  0.24675569  1.75081328
## Nashville      -0.910052287  0.54346445 -0.85923113
## New Orleans    -1.453857066  0.90075225 -1.99187430
## Norfolk        -0.589141903  0.75219052 -0.06092797
## Omaha          -0.133637672 -0.38478358  1.23581021
## Philadelphia    2.797074676 -0.65847826 -1.41511547
## Phoenix        -2.440096802 -4.19114925 -0.94155229
## Pittsburgh      0.322264830  1.02663680  0.74808213
## Providence      0.069936477  1.03390711  0.88774002
## Richmond       -1.171998946  0.33494902 -0.50862036
## Salt Lake City -0.912393969 -1.54734758  1.56510204
## San Francisco  -0.502073845 -2.25528717  0.22663991
## Seattle         0.481679438  1.59742576  0.60871204
## St. Louis       0.286187330 -0.38438239 -0.15567183
## Washington     -0.022928417 -0.05456742 -0.35387289
## Wichita        -0.196823158 -0.67607441  1.13122963
## Wilmington     -0.996140738  0.50074082  0.43332129
par(pty="s")
plot(usair.pc$scores[,1],usair.pc$scores[,2],
     ylim=range(usair.pc$scores[,1]),
     xlab="QoL",ylab="Wet weather",type="n",lwd=2)
text(usair.pc$scores[,1],usair.pc$scores[,2],
     labels=abbreviate(row.names(USairpollution)),cex=0.7,lwd=2)

par(pty="s")
plot(usair.pc$scores[,1],usair.pc$scores[,3],
     ylim=range(usair.pc$scores[,1]),
     xlab="QoL",ylab="Climate type",type="n",lwd=2)
text(usair.pc$scores[,1],usair.pc$scores[,3],
     labels=abbreviate(row.names(USairpollution)),cex=0.7,lwd=2)

par(pty="s")
plot(usair.pc$scores[,2],usair.pc$scores[,3],
     ylim=range(usair.pc$scores[,2]),
     xlab="Wet weather",ylab="Climate type",type="n",lwd=2)
text(usair.pc$scores[,2],usair.pc$scores[,3],
     labels=abbreviate(row.names(USairpollution)),cex=0.7,lwd=2)

3 PCR

par(mfrow=c(1,3))
plot(usair.pc$scores[,1],SO2,xlab="PC1")
plot(usair.pc$scores[,2],SO2,xlab="PC2")
plot(usair.pc$scores[,3],SO2,xlab="PC3")

usair.pcr <- lm(SO2~usair.pc$scores[,1]+usair.pc$scores[,2]+usair.pc$scores[,3])
summary(usair.pcr)
## 
## Call:
## lm(formula = SO2 ~ usair.pc$scores[, 1] + usair.pc$scores[, 2] + 
##     usair.pc$scores[, 3])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.420 -10.981  -3.184  12.087  61.273 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            30.049      2.907  10.336 1.85e-12 ***
## usair.pc$scores[, 1]    9.942      1.962   5.068 1.14e-05 ***
## usair.pc$scores[, 2]    2.240      2.374   0.943    0.352    
## usair.pc$scores[, 3]   -0.375      2.462  -0.152    0.880    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.62 on 37 degrees of freedom
## Multiple R-squared:  0.4182, Adjusted R-squared:  0.371 
## F-statistic: 8.866 on 3 and 37 DF,  p-value: 0.0001473

4 Big data in PCA

Another data set: “mnist_train.txt”

We only use the hand-writing number 3

number <- read.table("mnist_train.txt", 
                   header = F, sep = ",")
dim(number)
## [1] 60000   785
number_3 <- number[which(number[,1]==3),]
dim(number_3)
## [1] 6131  785
number_mat <- matrix(as.numeric(number_3[3,-1]), 28, 28, byrow = T)
mat1 <- apply(number_mat, 2, rev)
image(1:28, 1:28, t(mat1), axes = FALSE, xlab = "", ylab = "", 
      col = grey(seq(1, 0, length = 255)))

Some data reduction:

number_ <- number_3[,which(!apply(number_3,2,FUN = function(x){all(x == 0)}))]

number_$V1 <- NULL
dim(number_)
## [1] 6131  585

cor=FALSE indicates that we use covariance matrix here instead of correlation matrix, because the correlation matrix can only be used of there are no constant variables. But in such hand-writing data set, it’s more likely to include a constant variable.

out_pca <- princomp(number_, cor = FALSE)

Loadings of the first four PCs:

loadings_pca <- cbind(round(out_pca$loadings[,1], digits = 4), 
                      round(out_pca$loadings[,2], digits = 4), 
                      round(out_pca$loadings[,3], digits = 4), 
                      round(out_pca$loadings[,4], digits = 4))

# proportion of explained variance
out_pca$sdev["Comp.1"]^2/sum(out_pca$sdev^2)
##   Comp.1 
## 0.124696
out_pca$sdev["Comp.2"]^2/sum(out_pca$sdev^2)
##     Comp.2 
## 0.09268516
out_pca$sdev["Comp.3"]^2/sum(out_pca$sdev^2)
##     Comp.3 
## 0.07955255
out_pca$sdev["Comp.4"]^2/sum(out_pca$sdev^2)
##     Comp.4 
## 0.05490474

Extract the loadings of the factors only in the individual variables here

used_pca <-which(names(number)%in%c(colnames(number_)))

pc <- matrix(0, nrow = 4, ncol = 784)
v <-1

for (i in used_pca) {
  pc[,i] <- loadings_pca[v,]
  v <- v+1
}

number_mat <- matrix(as.numeric(number_3[3,-1]), 28, 28, byrow = T) 
mat1 <- apply(number_mat, 2, rev)

col <- colorRampPalette(c("white", "blue"))
col2 <- colorRampPalette(c("white", "red"))

for (i in 1:4) {
  
  pca_mat <- matrix(as.numeric(pc[i,]), 28, 28, byrow = T) 
  mat_pca <- apply(pca_mat, 2, rev) 
  
  
  #blue: positive loadings, red: negative loadings)
  A <- t(mat1)
  B <- (A>1) * t(mat_pca)
  B[ B<=0.0 ] <- NA 
  
  image(1:28, 1:28, A, axes = FALSE, xlab = "", ylab = "", col = grey(seq(1, 0, length = 255)))
  image(1:28, 1:28, B, axes = FALSE, xlab = "", ylab = "", 
        col= col(10), add=T)  
  
  A <- t(mat1)
  B <- (A>1) * t(mat_pca)
  B[ B>=0.0 ] <- NA 
  
  image(1:28, 1:28, B, axes = FALSE, xlab = "", ylab = "", 
        col= col2(10), add=T) 
  
  B <- t(mat_pca)
  B[A>1] <-NA 
  B[ B>=0.00 ] <- NA 
  image(1:28, 1:28, B, axes = FALSE, xlab = "", ylab = "", 
        col= col(10), add=T)  # overlay
  
  B <- t(mat_pca)
  B[A>1] <-NA 
  B[ B<=0.00 ] <- NA 
  image(1:28, 1:28, B, axes = FALSE, xlab = "", ylab = "", 
        col= col2(10), add=T)  # overlay 
  
}

The first four principle components barely explains the whole information here

We can’t see the clear structure of the negative or positive loadings of the the principle component on the individual variables, if we just look at 20% of explained variances. We would use some alternative techniques that would not be covered in this lecture

Important point: We need to understand what the principle component analysis does. For what reason we could do it

5 Factor Analysis

We consider the concept of manifest and latent variables

#if(!require(c(MASS, klaR))){
#  install.packages(c("psych", "corrplot", "GPArotation"))
#  library(MASS)
#}

library(MASS)

# import data
bfi <- readRDS("bfi.rds")
head(bfi)
##       A2 A3 A5 C1 C2 C3 C4 C5 N1 N2 N3
## 61617  4  3  4  2  3  3  4  4  3  4  2
## 61618  4  5  5  5  4  4  3  4  3  3  3
## 61620  4  5  4  4  5  4  2  5  4  5  4
## 61621  4  6  5  4  4  3  5  5  2  5  2
## 61622  3  3  5  4  4  5  3  2  2  3  4
## 61623  6  5  5  6  6  6  1  3  3  5  2
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked _by_ '.GlobalEnv':
## 
##     bfi
?bfi # info on original questionaire data

# corrplot
library(corrplot)
## corrplot 0.84 loaded
corr <- cor(na.omit(bfi))
corrplot(corr)

Some useful functions regarding to factor analysis

### factanal() -  stats
#?factanal
# - Estimation only ML
# - Rotation:  Varimax (orthogonal) & Promax (oblique)

### fa() -  psych
#?fa
# - Estimation:  'minres' - Minimum Residual
#               'ml'     - ML
#               'pa'     - Principle Axes
#               sowie einige weitere Methoden
# - Rotation: huge variety
library('GPArotation')
f <- fa(bfi, nfactors = 4, rotate = "oblimin", fm = "ml")
fa.organize(f)
## Factor Analysis using method =  ml
## Call: fa(r = bfi, nfactors = 4, rotate = "oblimin", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      ML3   ML4   ML1   ML2   h2   u2 com
## A2  0.61  0.07  0.03  0.02 0.38 0.62 1.0
## A3  0.80 -0.03 -0.02  0.04 0.63 0.37 1.0
## A5  0.62  0.00 -0.01 -0.12 0.43 0.57 1.1
## C1 -0.01  0.64  0.05 -0.02 0.38 0.62 1.0
## C2  0.02  0.69  0.01  0.05 0.47 0.53 1.0
## C3  0.06  0.44 -0.14  0.01 0.29 0.71 1.3
## C4  0.01 -0.44  0.26  0.12 0.41 0.59 1.8
## C5 -0.01 -0.01  0.91  0.00 0.85 0.15 1.0
## N1 -0.01 -0.05 -0.07  0.87 0.74 0.26 1.0
## N2 -0.01  0.04  0.04  0.82 0.68 0.32 1.0
## N3  0.03  0.04  0.10  0.64 0.44 0.56 1.1
## 
##                        ML3  ML4  ML1  ML2
## SS loadings           1.42 1.36 1.03 1.89
## Proportion Var        0.13 0.12 0.09 0.17
## Cumulative Var        0.13 0.25 0.35 0.52
## Proportion Explained  0.25 0.24 0.18 0.33
## Cumulative Proportion 0.25 0.49 0.67 1.00
## 
##  With factor correlations of 
##       ML3   ML4   ML1   ML2
## ML3  1.00  0.25 -0.21 -0.16
## ML4  0.25  1.00 -0.49 -0.09
## ML1 -0.21 -0.49  1.00  0.31
## ML2 -0.16 -0.09  0.31  1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  2.91 with Chi Square of  8124.83
## The degrees of freedom for the model are 17  and the objective function was  0.03 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  2761 with the empirical chi square  43.29  with prob <  0.00043 
## The total number of observations was  2800  with Likelihood Chi Square =  75.18  with prob <  2.7e-09 
## 
## Tucker Lewis Index of factoring reliability =  0.977
## RMSEA index =  0.035  and the 90 % confidence intervals are  0.027 0.043
## BIC =  -59.75
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML3  ML4  ML1  ML2
## Correlation of (regression) scores with factors   0.87 0.85 0.93 0.92
## Multiple R square of scores with factors          0.75 0.72 0.86 0.86
## Minimum correlation of possible factor scores     0.51 0.43 0.72 0.71
fa.diagram(f)

### Interpretation:
# F1: N1/N2/N3: 'emotional fluctuation'
# F2: A2/A3/A5:  'Empathy'
# F3: C1/C2/C3:  'Perfektionism'
#                     -> C3 and C4 small loadings, not included here 
# F4: C5: 'wasting my time', not interpretable
#               
# -> Conclude: Subsetting of C-Items seeems not to be reasonable by content


### (i) Decision on number Factor
fa.parallel(bfi)

## Parallel analysis suggests that the number of factors =  3  and the number of components =  3
# -> Comnpare Versions with 2, 3 & 4 F
f2 <- fa(bfi, nfactors = 2, rotate = "oblimin", fm = "ml")
f3 <- fa(bfi, nfactors = 3, rotate = "oblimin", fm = "ml")
f4 <- fa(bfi, nfactors = 4, rotate = "oblimin", fm = "ml")
fa.diagram(f2)

fa.diagram(f3)

fa.diagram(f4)

# -> 3 F most resonable
### (ii) Rotation techniques
# orthogonal
f_varimax <- fa(bfi, nfactors = 3, rotate = "varimax", fm = "ml")
f_quartimax <- fa(bfi, nfactors = 3, rotate = "quartimax", fm = "ml")
f_equamax <- fa(bfi, nfactors = 3, rotate = "equamax", fm = "ml")
# oblique
f_oblimin <- fa(bfi, nfactors = 3, rotate = "oblimin", fm = "ml")
f_promax <- fa(bfi, nfactors = 3, rotate = "promax", fm = "ml")

fa.diagram(f_varimax, main = "Varimax")

fa.diagram(f_quartimax, main = "Quartimax")

fa.diagram(f_equamax, main = "Equamax")

fa.diagram(f_oblimin, main = "Oblimin")

fa.diagram(f_promax, main = "Promax")

### Conclusion:
# 1) All orthogonal technique seem to yield identical results. 
#    Same for oblique. Both holds in general. If no general constraints, perform just one orth./obl FA 
#    and compare results
# 2) Both orthogonal and oblique yield nearly identical results .
#    Reason: nearly no correlation among F as depicted in plot for oblimin and promax. Both results 
#    would differ if F correlated
# -> Conclude: Here orth is fine.
# -> We keep using oblimin

信度最早由斯皮尔曼(Spearman)于1904年将其引入心理测量,指的是测验结果的一致性程度或可靠性程度。根据所关心的重点不同,信度可分为内在和外在信度两类。

内在信度指调查表中的一组问题是否测量的是同一个概念,也就是这些问题之间的内在一致性如何。最常用的内在信度指标为克朗巴哈系数和折半信度。最常用的外在信度指标是重测信度,即用同一问卷在不同时间对同一对象进行重复测量,然后计算一致程度

在实际研究中,很多事物/态度是不能直接被测量的,研究者们常设计一组题目间接反映它们的真实情况。但这些题目是否可以实现研究目的,就需要我们通过统计手段进一步分析了。如在本研究中,研究者设计了间接测量员工工作热情的6个题目,并希望判断它们的一致性。针对这种情况,我们可以使用Cronbach’s α分析。

解释:Cronbach’s α分析主要用于评价连续变量和有序分类变量的一致性,适用于本研究的研究数据。

Quelle: https://www.sohu.com/a/204514060_489312

#?alpha

### A-Items
alpha(bfi[,c("A2","A3","A5")])
## 
## Reliability analysis   
## Call: alpha(x = bfi[, c("A2", "A3", "A5")])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean sd median_r
##       0.72      0.72    0.64      0.46 2.6 0.0091  4.7  1     0.49
## 
##  lower alpha upper     95% confidence boundaries
## 0.7 0.72 0.74 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## A2      0.67      0.67    0.51      0.51 2.0    0.012    NA  0.51
## A3      0.56      0.56    0.39      0.39 1.3    0.017    NA  0.39
## A5      0.65      0.65    0.49      0.49 1.9    0.013    NA  0.49
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean  sd
## A2 2773  0.77  0.78  0.60   0.51  4.8 1.2
## A3 2774  0.84  0.83  0.70   0.59  4.6 1.3
## A5 2784  0.79  0.79  0.62   0.52  4.6 1.3
## 
## Non missing response frequency for each item
##       1    2    3    4    5    6 miss
## A2 0.02 0.05 0.05 0.20 0.37 0.31 0.01
## A3 0.03 0.06 0.07 0.20 0.36 0.27 0.01
## A5 0.02 0.07 0.09 0.22 0.35 0.25 0.01
# -> acceptable  (alpha = 0,72)
### N-Items
alpha(bfi[,c("N1","N2","N3")])
## 
## Reliability analysis   
## Call: alpha(x = bfi[, c("N1", "N2", "N3")])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
##       0.82      0.82    0.76       0.6 4.6 0.006  3.2 1.3     0.56
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.82 0.83 
## 
##  Reliability if an item is dropped:
##    raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## N1      0.71      0.71    0.55      0.55 2.4   0.0110    NA  0.55
## N2      0.71      0.71    0.56      0.56 2.5   0.0108    NA  0.56
## N3      0.83      0.83    0.71      0.71 4.8   0.0065    NA  0.71
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean  sd
## N1 2778  0.88  0.88  0.80   0.72  2.9 1.6
## N2 2779  0.87  0.88  0.80   0.71  3.5 1.5
## N3 2789  0.82  0.82  0.65   0.60  3.2 1.6
## 
## Non missing response frequency for each item
##       1    2    3    4    5    6 miss
## N1 0.24 0.24 0.15 0.19 0.12 0.07 0.01
## N2 0.12 0.19 0.15 0.26 0.18 0.10 0.01
## N3 0.18 0.23 0.13 0.21 0.16 0.09 0.00
# -> fine (alpha = 0,82)
### C-Items
#  Important: We observe neg q for C4 and C5  -> need to be modified in advance
bfi_new <- bfi
bfi_new$C4_rev <- 7 - bfi_new$C4
bfi_new$C5_rev <- 7 - bfi_new$C5
alpha(bfi_new[,c("C1","C2","C3","C4_rev","C5_rev")])
## 
## Reliability analysis   
## Call: alpha(x = bfi_new[, c("C1", "C2", "C3", "C4_rev", "C5_rev")])
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.73      0.73    0.69      0.35 2.7 0.0081  4.3 0.95     0.34
## 
##  lower alpha upper     95% confidence boundaries
## 0.71 0.73 0.74 
## 
##  Reliability if an item is dropped:
##        raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## C1          0.69      0.70    0.64      0.36 2.3   0.0093 0.0037  0.35
## C2          0.67      0.67    0.62      0.34 2.1   0.0099 0.0056  0.34
## C3          0.69      0.69    0.64      0.36 2.3   0.0096 0.0070  0.36
## C4_rev      0.65      0.66    0.60      0.33 2.0   0.0107 0.0037  0.32
## C5_rev      0.69      0.69    0.63      0.36 2.2   0.0096 0.0017  0.35
## 
##  Item statistics 
##           n raw.r std.r r.cor r.drop mean  sd
## C1     2779  0.65  0.67  0.54   0.45  4.5 1.2
## C2     2776  0.70  0.71  0.60   0.50  4.4 1.3
## C3     2780  0.66  0.67  0.54   0.46  4.3 1.3
## C4_rev 2774  0.74  0.73  0.64   0.55  4.4 1.4
## C5_rev 2784  0.72  0.68  0.57   0.48  3.7 1.6
## 
## Non missing response frequency for each item
##           1    2    3    4    5    6 miss
## C1     0.03 0.06 0.10 0.24 0.37 0.21 0.01
## C2     0.03 0.09 0.11 0.23 0.35 0.20 0.01
## C3     0.03 0.09 0.11 0.27 0.34 0.17 0.01
## C4_rev 0.02 0.08 0.16 0.17 0.29 0.28 0.01
## C5_rev 0.10 0.17 0.22 0.12 0.20 0.18 0.01
# ->  acceptable (alpha = 0,73)

6 Sructural Equation Modelling

rm(list = ls())

library(sem)
library(semPlot)
library(corrplot)
library(DiagrammeR)

# Duncan, Haller, and Portes's nonrecursive peer-influences model 
 
R.DHP <- readMoments(diag=FALSE, names=c('ROccAsp', 'REdAsp', 'FOccAsp', 
                                         'FEdAsp', 'RParAsp', 'RIQ', 'RSES',
                                         'FSES', 'FIQ', 'FParAsp'), text = "
.6247                                                 
.3269  .3669                                                
.4216  .3275  .6404                               
.2137  .2742  .1124  .0839                          
.4105  .4043  .2903  .2598  .1839                     
.3240  .4047  .3054  .2786  .0489  .2220                
.2930  .2407  .4105  .3607  .0186  .1861  .2707           
.2995  .2863  .5191  .5007  .0782  .3355  .2302  .2950      
.0760  .0702  .2784  .1988  .1147  .1021  .0931 -.0438  .2087
") 
R.DHP
##         ROccAsp REdAsp FOccAsp FEdAsp RParAsp    RIQ   RSES    FSES    FIQ
## ROccAsp  1.0000 0.0000  0.0000 0.0000  0.0000 0.0000 0.0000  0.0000 0.0000
## REdAsp   0.6247 1.0000  0.0000 0.0000  0.0000 0.0000 0.0000  0.0000 0.0000
## FOccAsp  0.3269 0.3669  1.0000 0.0000  0.0000 0.0000 0.0000  0.0000 0.0000
## FEdAsp   0.4216 0.3275  0.6404 1.0000  0.0000 0.0000 0.0000  0.0000 0.0000
## RParAsp  0.2137 0.2742  0.1124 0.0839  1.0000 0.0000 0.0000  0.0000 0.0000
## RIQ      0.4105 0.4043  0.2903 0.2598  0.1839 1.0000 0.0000  0.0000 0.0000
## RSES     0.3240 0.4047  0.3054 0.2786  0.0489 0.2220 1.0000  0.0000 0.0000
## FSES     0.2930 0.2407  0.4105 0.3607  0.0186 0.1861 0.2707  1.0000 0.0000
## FIQ      0.2995 0.2863  0.5191 0.5007  0.0782 0.3355 0.2302  0.2950 1.0000
## FParAsp  0.0760 0.0702  0.2784 0.1988  0.1147 0.1021 0.0931 -0.0438 0.2087
##         FParAsp
## ROccAsp       0
## REdAsp        0
## FOccAsp       0
## FEdAsp        0
## RParAsp       0
## RIQ           0
## RSES          0
## FSES          0
## FIQ           0
## FParAsp       1

Show in plot

corrplot(R.DHP)

model.DHP.2 <- specifyEquations(covs="RGenAsp, FGenAsp", text = "
RGenAsp = gam11*RParAsp + gam12*RIQ + gam13*RSES + gam14*FSES + beta12*FGenAsp
FGenAsp = gam23*RSES + gam24*FSES + gam25*FIQ + gam26*FParAsp + beta21*RGenAsp
ROccAsp = 1*RGenAsp
REdAsp = lam21(1)*RGenAsp  
FOccAsp = 1*FGenAsp
FEdAsp = lam42(1)*FGenAsp                                
") 
## NOTE: adding 4 variances to the model
sem.DHP.2 <- sem(model.DHP.2, S=R.DHP, N=329,
                 fixed.x=c('RParAsp', 'RIQ', 'RSES', 'FSES', 'FIQ', 'FParAsp'))
summary(sem.DHP.2)
## 
##  Model Chisquare =  26.69722   Df =  15 Pr(>Chisq) = 0.03130238
##  AIC =  64.69722
##  BIC =  -60.24365
## 
##  Normalized Residuals
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.79953 -0.11783  0.00000 -0.01201  0.03974  1.56525 
## 
##  R-square for Endogenous Variables
## RGenAsp FGenAsp ROccAsp  REdAsp FOccAsp  FEdAsp 
##  0.5220  0.6170  0.5879  0.6639  0.6888  0.5954 
## 
##  Parameter Estimates
##                    Estimate    Std Error  z value    Pr(>|z|)    
## gam11               0.16122238 0.03879229  4.1560415 3.238091e-05
## gam12               0.24964951 0.04398093  5.6763131 1.376288e-08
## gam13               0.21840339 0.04419737  4.9415476 7.750487e-07
## gam14               0.07183929 0.04970696  1.4452563 1.483859e-01
## beta12              0.18423232 0.09488787  1.9415793 5.218805e-02
## gam23               0.06188691 0.05171968  1.1965835 2.314690e-01
## gam24               0.22886711 0.04416218  5.1824232 2.190216e-07
## gam25               0.34903540 0.04528979  7.7067130 1.290997e-14
## gam26               0.15953413 0.03882593  4.1089578 3.974486e-05
## beta21              0.23547779 0.11938929  1.9723527 4.856936e-02
## lam21               1.06267767 0.09013865 11.7893677 4.428532e-32
## lam42               0.92972555 0.07028108 13.2286752 5.993451e-40
## V[RGenAsp]          0.28098694 0.04623153  6.0778199 1.218274e-09
## C[RGenAsp,FGenAsp] -0.02260935 0.05119391 -0.4416413 6.587488e-01
## V[FGenAsp]          0.26383537 0.04466688  5.9067334 3.489577e-09
## V[ROccAsp]          0.41214523 0.05122464  8.0458399 8.565593e-16
## V[REdAsp]           0.33614544 0.05209991  6.4519386 1.104283e-10
## V[FOccAsp]          0.31119476 0.04592712  6.7758385 1.236868e-11
## V[FEdAsp]           0.40460387 0.04618438  8.7606206 1.941783e-18
##                                        
## gam11              RGenAsp <--- RParAsp
## gam12              RGenAsp <--- RIQ    
## gam13              RGenAsp <--- RSES   
## gam14              RGenAsp <--- FSES   
## beta12             RGenAsp <--- FGenAsp
## gam23              FGenAsp <--- RSES   
## gam24              FGenAsp <--- FSES   
## gam25              FGenAsp <--- FIQ    
## gam26              FGenAsp <--- FParAsp
## beta21             FGenAsp <--- RGenAsp
## lam21              REdAsp <--- RGenAsp 
## lam42              FEdAsp <--- FGenAsp 
## V[RGenAsp]         RGenAsp <--> RGenAsp
## C[RGenAsp,FGenAsp] FGenAsp <--> RGenAsp
## V[FGenAsp]         FGenAsp <--> FGenAsp
## V[ROccAsp]         ROccAsp <--> ROccAsp
## V[REdAsp]          REdAsp <--> REdAsp  
## V[FOccAsp]         FOccAsp <--> FOccAsp
## V[FEdAsp]          FEdAsp <--> FEdAsp  
## 
##  Iterations =  32
pathDiagram(sem.DHP.2)

sem_without_labels

pathDiagram(sem.DHP.2,edge.labels="values")

sem_with_labels